Closed Bug 625981 Opened 14 years ago Closed 13 years ago

places.sqlite-wal takes too much space for Fennec

Categories

(Firefox for Android Graveyard :: Bookmarks, defect, P2)

ARM
Android

Tracking

(fennec+)

RESOLVED WORKSFORME
Tracking Status
fennec + ---

People

(Reporter: mbrubeck, Unassigned)

References

Details

(Keywords: mobile, regression, Whiteboard: [places-next-wanted])

Attachments

(3 files)

+++ This bug was initially created as a clone of Bug #608422 +++

Using the latest trunk build of Fennec for Android, the places.sqlite-wal file in my profile directory has grown to over 70 MB.  Because disk space is scarce on many mobile devices, we want to prevent this file from growing so large in Fennec.  Bug 608422 has details of how this growth was capped for the cookies WAL file.
Component: Bookmarks & History → Bookmarks
Product: Firefox → Fennec
QA Contact: bookmarks → bookmarks
Version: unspecified → Trunk
tracking-fennec: --- → ?
interesting, we already implemented the wal limit in bug 609122, so looks like it's not working?
coul you please debug that code and check what's final value of checkpointPages on your device?
and actually, after bug 609122, we changed that code so that max wal size should be around 512KiB (see #define DATABASE_MAX_WAL_SIZE_IN_KIBIBYTES 512)

It's possible there is something wrigin a really huge amount of data in a single transaction, but 70MB looks really a lot. Based on our code the wal should auto commit when it goes over 512KiB
s/wrigin/writing/
tracking-fennec: ? → 2.0+
Assignee: nobody → pavlov
(In reply to comment #2)
> coul you please debug that code and check what's final value of checkpointPages
> on your device?

In the debugger I see checkpointPages = 16, as expected.

On my Fennec desktop build, I saw a sudden jump from 1MB to 11MB after re-enabling Sync in a profile that had it disabled for a while.

I am seeing this problem in the desktop Minefield nightly builds on Linux x86 too.  My places.sqlite-wall file for Minefield is currently 36MB.
If I look at my cookies.sqlite-wal it's bigger than 512KiB as well (I've seen it greater than 1.5MB), I guess the limit we set could be somewhat lazy. In our case maybe if Sync executes a large transaction, the file grows and till the next change it's not shrinked. Btw, I also noticed a 30MB wal file on my desktop profile. But right now we do what the SQLite manual suggests with the pragma.
I see the same growth of Fennec using 70+MB storage, which brings my phone to a halt. Restarting (and crash-restaring) Fennec doesn't help - still the same. I was hoping for a cleanup to kick in or for a way to limit data size to a maximum (while expiring older entries where possible)
marco, wondering about priority - does this 70+mb get read into memory, or mmapped?
(In reply to comment #8)
> marco, wondering about priority - does this 70+mb get read into memory, or
> mmapped?
The WAL file is written and read to via shared memory AFAIK.
Severity: normal → major
Priority: -- → P2
See Also: → 615519
actually there is a specific case where WAL can grow more than the setup checkpoint size: if there is a read transaction open the wal cannot be committed, but this means we have some longstanding transaction that should be splitted and we should figure out which one is it.
Does disabling Sync change anything in the size of the wal?
Assignee: pavlov → nobody
SQLite 3.7.6 will introduce new WAL checkpointing methods, so that it could be possible to force a blocking full checkpoint in a more granular way.
fwiw (maybe a seperate bug), the nss sql files also take up 512kb on disk.
(In reply to comment #11)
> SQLite 3.7.6 will introduce new WAL checkpointing methods, so that it could be
> possible to force a blocking full checkpoint in a more granular way.

can you point to a patch or set of patches that implement this so we can see if it fixes the issue?
That version of SQLite is not yet out, it's a draft, it could still change so it's not suggested to build patches on top of it for now.

Does disable Sync on the device improve the situation? (this can really help understand if we can look at Places or at Sync first to reduce transactions size)
Matt, can you answer Marco's question(s) please?
(In reply to comment #14)
> That version of SQLite is not yet out, it's a draft, it could still change so
> it's not suggested to build patches on top of it for now.
> 

We are happy to branch the SQLite source tree, adding this change into a branch for release 3.7.5 (3.7.5.1) if you like.  All you have to do is ask.
(In reply to comment #16)
> We are happy to branch the SQLite source tree, adding this change into a branch
> for release 3.7.5 (3.7.5.1) if you like.  All you have to do is ask.
If that change is well contained, that might be worthwhile.  I don't think we want to take a new version with other changes before we release Firefox 4.0.
(In reply to comment #16)
> (In reply to comment #14)
> > That version of SQLite is not yet out, it's a draft, it could still change so
> > it's not suggested to build patches on top of it for now.
> > 
> 
> We are happy to branch the SQLite source tree, adding this change into a branch
> for release 3.7.5 (3.7.5.1) if you like.  All you have to do is ask.

Has that change landed in the sqlite tree? I'm basically asking if there's something we can do to test if this sqlite feature fixes our problem or not before we get into conversations about whether to take the change or not.
(In reply to comment #14)
> Does disable Sync on the device improve the situation? (this can really help
> understand if we can look at Places or at Sync first to reduce transactions
> size)

In my current profile on Android, the places.sqlite-wal file is about 6MB.  After I disabled Sync and restarted Fennec, the 6MB WAL file is still there.

When I exit Fennec with the "Quit Fennec" add-on, the cookies.sqlite-wal file is deleted, but the places.sqlite-wal file is not deleted.

Next I'll see what happens in a new empty profile with Sync disabled...
Summary: places.sqlite-wal takes too much space for Fennec → places.sqlite-wal takes too much space for Fennec and is not removed on exit
(In reply to comment #18)
> Has that change landed in the sqlite tree? I'm basically asking if there's
> something we can do to test if this sqlite feature fixes our problem or not
> before we get into conversations about whether to take the change or not.
As far as I know, yes, and you can grab it here: http://www.sqlite.org/draft/download.html
In a new empty profile without Sync enabled, the places.sqlite-wal is 1082168 bytes (~1057 KB) on firstrun.  After about 40 unique page loads, the size of the WAL file has not changed.  As before, exiting Fennec removes cookies.sqlite-wal but not places.sqlite-wal.
(In reply to comment #22)
> Have we looked at reverting the wal mode on mobile to see the perf hit (if
> any)?
If we disable WAL, we need to set PRAGMA synchronous = FULL too, which increases the amount of fsyncs we do.  I'm certain that will cause regressions (although we may not have any tests that measure it well).
it isn't obvious that synchronous needs to be full on mobile.  I am happy to force sync's when we are about to exist or are in the background.  is there a technical reason that you suggest FULL is needed?
(In reply to comment #24)
> it isn't obvious that synchronous needs to be full on mobile.  I am happy to
> force sync's when we are about to exist or are in the background.  is there a
> technical reason that you suggest FULL is needed?
I'm not suggesting, I'm insisting.  When we used synchronous = NORMAL, we ended up getting a number of bugs that came down to corrupt databases (even on journaled file systems).  Those have almost completely gone away (at the very least, the frequency at which hey occur is markedly lower).  The corruption that we were seeing wasn't always immediately detectable either, so we couldn't just throw the database away at startup and use Sync to get the data again.

Has anybody tried the build of SQLite I linked to in comment 20?  I recall talking to someone about this (can't recall who or if was real life or irc), but also figuring out if the call to sqlite3_close failed or not would be helpful.

The right solution here isn't to disable WAL since the issue doesn't seem to be looked into much at all yet.  A number of recent architectural decisions may have to be revisited if we did go down that road, so the cost of doing that isn't free either.
I wanted to question the blocking status of this bug since Fennec can now move profiles to the sdcard.

Also, we don't seem to have much forward movement in this bug, so the risk of this missing Fennec 4 grows.

Lastly, this bug has no owner.
Is the size of Sync bundles (number of entries added in a single transaction) configurable? You could try to measure benefits of reducing it. The results in comment 21 don't seem that bad.
(In reply to comment #27)
> Is the size of Sync bundles (number of entries added in a single transaction)
> configurable? You could try to measure benefits of reducing it. The results in
> comment 21 don't seem that bad.
comment 21 indicates that the WAL file isn't going away still, which is a bug (and implies that Sync isn't the cause of that problem).
ah, yes, that means there could be some unfinalized statement. Btw these 2 bugs toghether makes less sense, there should be 1 bug for the size and one for the not-removed file, since the causes are obviously different.
(In reply to comment #29)
> ah, yes, that means there could be some unfinalized statement. Btw these 2 bugs
> toghether makes less sense, there should be 1 bug for the size and one for the
> not-removed file, since the causes are obviously different.
Agreed.
Blocks: 636161
I split off the "not removed on exit" into a new bug 636161.
Summary: places.sqlite-wal takes too much space for Fennec and is not removed on exit → places.sqlite-wal takes too much space for Fennec
Assignee: nobody → alexp
tracking-fennec: 2.0+ → 2.0-
there is another issue you would hit disabling WAL, Ginn Chen had to do that on Solaris since WAL doesn't work reliably there, and now a lot of thread locks are happening there when using the locationbar. WAL solved a bunch of thread-locking by not blocking readers.
Whiteboard: [fennec-4.1?]
Whiteboard: [fennec-4.1?] → [fennec-4.1?][places-next-wanted]
SQLite 3.7.6 gives us the ability to use blocking WAL
tracking-fennec: - → ?
What are the next steps here? We need some forward progress.
Assignee: alexp → azakai
tracking-fennec: ? → 6+
Whiteboard: [fennec-4.1?][places-next-wanted] → [places-next-wanted]
The relevant SQLite interfaces can be seen at here:

    http://www.sqlite.org/c3ref/wal_checkpoint_v2.html
    http://www.sqlite.org/pragma.html#pragma_wal_checkpoint

A "checkpoint" operation is when content in the WAL file is written back into the main database.  Checkpoints normally happen automatically when the WAL file reaches about 1MB in size (and at other times too - the details are not important.)  After a successful checkpoint, the WAL is reset back to a small size.  But a checkpoint cannot run to completion if there is an active concurrent reader, since if it did, the checkpoint would delete content out from under the active readers, which would be very bad for the readers.  So a checkpoint will only run to completion and reset the WAL file if there are no active readers at the point in time when the checkpoint is attempted.  Checkpoints are normally attempted at the conclusion of every write transaction when the WAL file size exceeds threshold (though, of course, you can manually run a checkpoint whenever you want using appropriate APIs.)

The enhancement to SQLite in 3.7.6 is that you can now request that a checkpoint operation block until it is able to run to completion.  This ensures that the WAL file will be reset back to a small size.  The downside, of course, is that whatever thread that is doing that checkpoint now blocks until all the readers of the database file finish their current transaction.

I hope this explanation helps you to figure out where blocking checkpoints belong in FF.
(In reply to comment #34)
> What are the next steps here? We need some forward progress.
We need to call PRAGMA wal_checkpoint when the WAL file gets too big (or perhaps on memory pressure notifications?).  Something is still clearly broken because this is only seen in Fennec.
(In reply to comment #34)
> What are the next steps here? We need some forward progress.

This bug is in Places plan for FF6 (see https://wiki.mozilla.org/Places:Plan#Mobile_Needs), I hope Sdwilsh can take the leadership here, since I'm driving other work for Sync and perf.
I don't mind who writes the patches, but this bug needs a strong leadership.

In the last period I've looked at Sync code multiple times, and I think there is something there that could cause large read transactions, like the fact the first sync of the day reads in all bookmarks ids, and then for each id it runs a bunch of read queries. I'm planning and doing work to solve some of these perf issues reducing queries traffic generated by Sync.

But we should also update to Sqlite to 3.7.6 and implement some size check to the wal to force blocking checkpoints when needed.

(In reply to comment #36)
> (In reply to comment #34)
> Something is still clearly
> broken because this is only seen in Fennec.

No, I see this on my desktop profile, my WAL is currently 32MB while it is usually 768KB, and I suspect it's related to Sync, indeed from previous comments looks like nobody was able to reproduce the problem without Sync.
Having a dump of all the read/write transactions could help figuring out the culprit.
I've moved Fennec to SD Card and after that I've synced the app with my desktop. When the Sync was completely performed, places.sqlite-wal size was about ~200Mb. After that, from Preferences I've tapped on Clear button. Clear Private Data popup was displayed and I've tapped on OK button. Instead of clearing data and sync to be disconnected, the app just freeze. Meanwhile I was watching if place.sqlite-wal is changing its size and in about 7 minutes it has reached up to ~1,7 Gb as you can see on attached screenshot. 
Is this issue related to this bug?
so is there something sync and/or "clear private data" can do in the mean time to make this situation better?
probably someone could try changing the sync values (size of the batches, interval between batches) and test on a real device if the situation improves.
cc-ing philipp he could give hints on variables to edit to sync more lazily
another interesting question, when you have a large wal, if you add a bookmark, does its size reduce?
Attached patch wip testSplinter Review
I started to write a test. Running it on my desktop, I get a WAL file of size 1,082,168 after adding a single bookmark.

The printout I added in that patch implies that checkpointPages is 16, and with the page size of 32K, the max WAL size should be 512K. But in the test it is over 1MB.
So, that implies that somewhere we have an active reader.
SQLite does not necessarily checkpoint after every transaction.  It waits until the WAL file reaches a threshold size and then it tries to checkpoint.  The default value of that threshold is 1MB.  The threshold can be modified using either "PRAGMA wal_autocheckpoint" or sqlite3_wal_autocheckpoint()

  http://www.sqlite.org/pragma.html#pragma_wal_autocheckpoint
  http://www.sqlite.org/c3ref/wal_autocheckpoint.html

Furthermore, the WAL file is not truncated when a checkpoint runs.  Rather, SQLite moves the pointer back to the beginning of the file and starts overwriting it with new content.  On most systems, overwriting an existing file is faster than truncating and appending to a file.  The WAL file is not normally deleted and space reclaimed until the last database connection that is using that file closes.
Adding several thousand bookmarks, I see the -wal file go through the following sizes:

  1082168, 1344504, 1770800, 1869176, 2295472, 3049688

which, for a page size of 32K, and ignoring a few percent which I assume is overhead, means the number of pages in the WAL file is

  33, 41, 54, 57, 70, 93

all of which is more than the allowed 16 pages as we try to set by PRAGMA wal_autocheckpoint. In other words it appears PRAGMA wal_autocheckpoint is simply not working.

In general we seem to be off by orders of magnitude from what the sqlite API expects - its default is 1,000 pages compared to our 16. Perhaps they don't care about getting a few dozen pages too many, so the code allows that, but we do care about such things?

(In reply to comment #44)
> So, that implies that somewhere we have an active reader.

Can you please elaborate? I am unsure how to proceed in this bug.
(In reply to comment #45)
> Furthermore, the WAL file is not truncated when a checkpoint runs.  Rather,
> SQLite moves the pointer back to the beginning of the file and starts
> overwriting it with new content.  On most systems, overwriting an existing
> file is faster than truncating and appending to a file.  The WAL file is not
> normally deleted and space reclaimed until the last database connection that
> is using that file closes.

Ah, so if we have a large transaction creating a 100MB WAL, even if we have the autocheckpoint to 1MB the wal will never reduce its size. It can only grow?
I think I've never seen this explained in the wal documentation, I really thought it was being truncated.
This could be part of our problem and explain why when my wal grows it never reduces. We probably need a way to truncate unused space from that file.
SQLite can only rewind the WAL file and start using it over again from the beginning when it does a checkpoint while there are no simultaneous readers.  If there are concurrent readers when the checkpoint runs, the checkpoint cannot run to completion, or else it would delete content out from under the readers, which makes the readers very unhappy.

So if a WAL is growing without bound, that suggests that there is a reader holding a transaction open someplace, and thus preventing the checkpoint from running to completion.

In SQLite version 3.7.6 you can say:

    PRAGMA wal_checkpoint=RESTART;

Normally, a checkpoint will do as much as it can without disrupting readers, then quit.  But with the =RESTART option, the wal_checkpoint will block until all readers clear and it can rewind back to the beginning of the WAL file, thus preventing the WAL from growing.

If you really do have a reader holding a transaction open, then wal_checkpoint=RESTART will block forever.  If nothing else, that will tell us that the problem really is a hung reader and not something we haven't thought of...
Thanks, we can do that check with the new SQLite version locally.

What worries me is that we always use our transaction helper, that commits or rollbacks a transaction when going out of scope, so there is no way we could forget to close a transaction (unless it leaks but we would see it).
Sync does not create transactions on places.sqlite.
mozStorageAsyncStatementExecution uses a mozStorageTransaction (it was unable to distinguish write statements, but this has been fixed in recent SQLite versions and we should fix it!). It doesn't return before closing it though, even in case of errors.
I tested with SQLite 3.7.6.2 and PRAGMA wal_checkpoint = RESTART. I don't see any change, the problem remains, and nothing blocks forever.

(Note that there isn't any indication I can see that that command ran successfully - mDBConn->ExecuteSimpleSQL appears to return a successful rv regardless if I enter spurious data, like PRAGMA wal_checkpoint = CHEEZE. Is there some other way to check, just to make sure?)

I also did a separate check with and without our existing call to PRAGMA wal_autocheckpoint. It looks like it does help significantly - the WAL file grows much much faster without it. But it looks like a matter of speed and not of enforcing a limit.

P.S. I'm doing these checks in an xpcshell test. Is there any reason SQLite behavior/transactions/etc. would be different there than normally?
Notice we also call "pragma wal_checkpoint" after most of the bookmarks changes, like insertBookmark, maybe you could want to also check how early retuning there (thus skipping the forced checkpoint) influences growth (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/Helpers.cpp#358)

xpcshell should not change things significantly, apart that in browser you have more queries traffic, while here you can define the traffic you want.
Are you testing with current Trunk code?
(In reply to comment #52)
> Notice we also call "pragma wal_checkpoint" after most of the bookmarks
> changes, like insertBookmark, maybe you could want to also check how early
> retuning there (thus skipping the forced checkpoint) influences growth
> (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/
> Helpers.cpp#358)
> 

I don't see a difference with that.

> Are you testing with current Trunk code?

Yeah, this is m-c from last Friday.
Does SQLite use the "no truncation" in general? I notice that when we clear private data (all stored in SQLite files) that the data seems to be gone, but the file sizes stay the same.
the database cannot resize unless you execute a VACUUM (we do on idle-daily once a month)
We probably should VACUUM on shutdown for fennec.
Why don't we let users to choose when they want to clear their databases? I think that is too much (for example to keep 100 Mb data for a month, without clearing everything, is not really ok)
(In reply to comment #57)
> We probably should VACUUM on shutdown for fennec.
Not unless you want shutdown to take longer.  Maybe that doesn't matter to mobile, but that sure does matter to desktop.

(In reply to comment #58)
> Why don't we let users to choose when they want to clear their databases? I
> think that is too much (for example to keep 100 Mb data for a month, without
> clearing everything, is not really ok)
I think what you want is bug 643254 and is unrelated to this bug.
(In reply to comment #59)
> (In reply to comment #57)
> > We probably should VACUUM on shutdown for fennec.
> Not unless you want shutdown to take longer.  Maybe that doesn't matter to
> mobile, but that sure does matter to desktop.
I don't think we care much about shutdown time for Android at least
We are again building ideas on vapor. What does make you think the problem is vacuum? and why discussing it in a bug about wal?
From my point of view, vacuuming at each shutdown would be a pretty bad choice. The only relevant use for vacuum is whether you cleared all history, otherwise vacuuming once a month can be more then enough, we can eventually make the interval customizable. If mobile doesn't ever fire idle-daily and needs to run it at another time that's another bug that should be filed.
But please don't overlap bugs.
There is this wal issue to fix, and there is to investigate whether Sync is pushing too much data to mobile (that causes the db to grow and then this is practically immediately expired)
Indeed, Sync is pushing too much data, but I think that is happening the same thing but slower when user are doing a lot of browsing without clear any data (without perform Sync).
(In reply to comment #61)
> We are again building ideas on vapor. What does make you think the problem
> is vacuum? and why discussing it in a bug about wal?

Very true. I am opening a new bug to look at doing a vacuum when a user does a "clear private data"
I did some debugging into the sqlite3 code, and found the following:

1. nsNavBookmarks Clone()s a connection. This ends up calling openDatabase() in sqlite3, which applies the default wal_autocheckpoint value of 1,000. In other words, we set wal_autocheckpoint to 16, but Clone()s of that database appear to not inherit it.

Changing this does not fix the problem entirely, though, since there is a second, unrelated issue,

2. sqlite *does* attempt to flush the WAL every 16 pages, as we requested. However, it doesn't always succeed :( If this happens several times in a row, data accumulates, and the WAL file is enlarged if necessary. And the WAL file is never shrunk. So one unlucky run of failures leads to a large, very long living WAL file.

The failures in my tests are all caused by sqlite3BtreeCheckpoint hitting pBt->inTransaction!=TRANS_NONE. A sqlite comment says that happens when "this or any other connection has an open transaction on the shared-cache the argument Btree is connected to."
(In reply to comment #64)
> 1. nsNavBookmarks Clone()s a connection. This ends up calling openDatabase()
> in sqlite3, which applies the default wal_autocheckpoint value of 1,000. In
> other words, we set wal_autocheckpoint to 16, but Clone()s of that database
> appear to not inherit it.

We have other clones too (Autocomplete). I have a patch in queue (bitrotted now, but can be easily fixed) that opens clones at startup setting pragmas on all of them. Notice however, that our cloned connections are read-only, so they are not going to write anything to the wal. I think nothing should change even fixing this.

> 2. sqlite *does* attempt to flush the WAL every 16 pages, as we requested.
> However, it doesn't always succeed :( If this happens several times in a
> row, data accumulates, and the WAL file is enlarged if necessary. And the
> WAL file is never shrunk. So one unlucky run of failures leads to a large,
> very long living WAL file.

This is known, that's why we ask for more checkpoints than usual (like after any bookmark operation or batch).

> The failures in my tests are all caused by sqlite3BtreeCheckpoint hitting
> pBt->inTransaction!=TRANS_NONE. A sqlite comment says that happens when
> "this or any other connection has an open transaction on the shared-cache
> the argument Btree is connected to."

this is comment 48, but I'd like to see what is creating that transaction :(
In the next release of SQLite (3.7.7) the WAL file will honor the "PRAGMA journal_size_limit" setting.  This means that if you set a journal_size_limit, the WAL file might grow larger than that temporarily, but it will truncate back down the the journal_size_limit as soon as it can, rather than persisting at the larger size forever.

Dunno if that helps or not...
> > The failures in my tests are all caused by sqlite3BtreeCheckpoint hitting
> > pBt->inTransaction!=TRANS_NONE. A sqlite comment says that happens when
> > "this or any other connection has an open transaction on the shared-cache
> > the argument Btree is connected to."
> 
> this is comment 48, but I'd like to see what is creating that transaction :(

Yes, that is what I will find out next.


> In the next release of SQLite (3.7.7) the WAL file will honor the "PRAGMA
> journal_size_limit" setting.  This means that if you set a
> journal_size_limit, the WAL file might grow larger than that temporarily,
> but it will truncate back down the the journal_size_limit as soon as it can,
> rather than persisting at the larger size forever.
> 
> Dunno if that helps or not...

That sounds like it will be very helpful! :)

When we can use that option, we definitely should.
(In reply to comment #64)
> 1. nsNavBookmarks Clone()s a connection. This ends up calling openDatabase() in
> sqlite3, which applies the default wal_autocheckpoint value of 1,000. In other
> words, we set wal_autocheckpoint to 16, but Clone()s of that database appear to
> not inherit it.
We should fix mozIStorageConnection::Clone to also set this pragma to the other database's value.  Can you file a bug for that?

(In reply to comment #67)
> When we can use that option, we definitely should.
Probably for Firefox 7, but we should open a different bug for that too.
Depends on: 658303
(In reply to comment #68)
> (In reply to comment #64)
> > 1. nsNavBookmarks Clone()s a connection. This ends up calling openDatabase() in
> > sqlite3, which applies the default wal_autocheckpoint value of 1,000. In other
> > words, we set wal_autocheckpoint to 16, but Clone()s of that database appear to
> > not inherit it.
> We should fix mozIStorageConnection::Clone to also set this pragma to the
> other database's value.  Can you file a bug for that?

Filed bug 658303.

> 
> (In reply to comment #67)
> > When we can use that option, we definitely should.
> Probably for Firefox 7, but we should open a different bug for that too.

Filed bug 658305.
The major cause of the failures to checkpoint the WAL file is that SQLite, in sqlite3BtreeCommitPhaseTwo, will convert a write lock to a read lock. That read lock is sometimes still there when we try to write later.

(The SQLite code comment says "If there are no active cursors, it also releases the read lock", however, I don't see where that can happen, to my naive eye it looks like it always leaves a read lock unless an error occurs.)

So it isn't that we have an active reader that prevents checkpointing the WAL file, instead we have a writer that keeps a read lock.
(In reply to comment #70)
> The major cause of the failures to checkpoint the WAL file is that SQLite,
> in sqlite3BtreeCommitPhaseTwo, will convert a write lock to a read lock.
> That read lock is sometimes still there when we try to write later.
> 
> (The SQLite code comment says "If there are no active cursors, it also
> releases the read lock", however, I don't see where that can happen, to my
> naive eye it looks like it always leaves a read lock unless an error occurs.)

   http://www.sqlite.org/src/artifact/975ad691a5?ln=3180-3181

I think this comment is misleading you.  For one, that comment was written in 2007 whereas WAL mode was added in 2010.  The comment makes more sense if you are thinking in terms of rollback-mode. The comment is still techinically correct, but it needs to be understood through the world-view of the btree layer, which has its own idea of locking that is distinct from what is happening in the lower level pager layer.  The read-lock is released in unlockBtreeIfUnused() which is called from btreeEndTransaction().

   http://www.sqlite.org/src/artifact/975ad691a5?ln=2433-2451
   http://www.sqlite.org/src/artifact/975ad691a5?ln=3151
   http://www.sqlite.org/src/artifact/975ad691a5?ln=3205

> 
> So it isn't that we have an active reader that prevents checkpointing the
> WAL file, instead we have a writer that keeps a read lock.

I'll check on this by writing some new test cases to see if I can induce SQLite in WAL mode to somehow block a checkpoint just be keeping a database connection open.  We use a virtual filesystem for a lot of our testing (the better to simulate I/O errors and system crashes with) so perhaps I can add some assert()s or something to the virtual filesystem that will alert us if WAL locks linger unnecessarily.  I'll also see if I can't come up with some kind of diagnostic aids to help pinpoint exactly why checkpoints are not happening as expected.
> I'll check on this by writing some new test cases to see if I can induce
> SQLite in WAL mode to somehow block a checkpoint just be keeping a database
> connection open.  We use a virtual filesystem for a lot of our testing (the
> better to simulate I/O errors and system crashes with) so perhaps I can add
> some assert()s or something to the virtual filesystem that will alert us if
> WAL locks linger unnecessarily.  I'll also see if I can't come up with some
> kind of diagnostic aids to help pinpoint exactly why checkpoints are not
> happening as expected.

That sounds great, thanks!
Adding bug 658135 to deps, while it may not have an immediate valuable impact, it should remove some useless transactions (like those caused by autocomplete).
Depends on: 658135
TC: https://litmus.mozilla.org/single_result.cgi?id=427800 fails due to this bug. I cannot sync bookmarks on mobile, from desktop.

Mozilla /5.0 (Android;Linux armv7l;rv:6.0a1) Gecko/20110520 Firefox/6.0a1
TC https://litmus.mozilla.org/single_result.cgi?id=427259 fails due to this bug. Sync is impossible to be done for some devices with a large account.
http://litmus.mozilla.org/single_result.cgi?id=427175 failed due to this one. 
(sorry for all the mails)
(In reply to comment #76)
> Also https://litmus.mozilla.org/single_result.cgi?id=427344 fails due to this
> bug.
Unlikely given that that only relates to cookies.
Attached image data usage
I have HTC Inspire (att) and have been running 4.0 and then 4.0.1 and using sync without issue. Today I downloaded the beta and set up sync and the data usage for Firefox rocketed up over 700MB. I uninstalled both the beta and 4.0.1. I installed the beta again and set up sync and data usage again went over 800MB. So I installed 4.0.1 again and set up sync and didn't have that problem. After my initial sync it was only using a total of 29MB.

I also repeated this on a Droid2 using Nightly and 4.0.1 and got the same result. Nightly used 800MB after syncing and 4.0.1 used 29MB.
Depends on: 660036
(In reply to comment #80)
> Nightly used 800MB after syncing and 4.0.1 used 29MB.

I see this too - the WAL file growth seems much worse in Firefox 5 and later than it was in Firefox 4.  Filed bug 660036 for this regression.
tracking-fennec: 6+ → 5+
tracking-fennec: 5+ → 6+
D. Richard Hipp  --  any update?
tracking-fennec: 6+ → 7+
I think this bug is a duplicate of 660036 (https://bugzilla.mozilla.org/show_bug.cgi?id=660036), which has now been fixed.  No?
The sudden and huge increase in bug 660036 has been fixed, but there is still a slower, smaller increase over time.  Ideally for mobile we would like a way to have a hard cap on the WAL size, or at least to be able to truncate it periodically to prevent it growing monotonically.
We just have to wait for SQLite 3.7.7, with wal obeing the journal size limit.  A small increase over time is not as bad as the already fixed bug (that should have lot more positives than just wal size).
Also, for the upgrade we need to figure out the failure in bug 610789, since trying to upgrade to SQLite 3.7.6 caused that test to go permaorange.
SQLite 3.7.7 is due out before the end of June.  Are you able to put a beta copy through your tests?  You can download a beta from http://www.sqlite.org/draft/download.html

Note at http://www.sqlite.org/checklists/3070700 on item 22 that we have built the FF nightly using SQLite 3.7.7 beta and it seems to be working great for us.  But all we do is a basic smoke-test - start using FF and verify that it seems to work without crashing.  We don't have the extensive test suite that you do.  Also, our test is on a Linux workstation, not on Android.

What else can the SQLite team do to help resolve the bug 610789 problem?
(In reply to comment #89)
> SQLite 3.7.7 is due out before the end of June.  Are you able to put a beta
> copy through your tests?  You can download a beta from
> http://www.sqlite.org/draft/download.html

Sure, I'll push an upgrade to Tryserver and check tests.

> What else can the SQLite team do to help resolve the bug 610789 problem?

I think it's some race condition on our side, that test is already random failing, we should figure out the original failure reason first of all.
SQLite 3.7.7 beta was fine on tryserver for all tests, but bug 610789. Nothing new there. While 4 runs failed silently, one reported this:

WARNING: SQLite returned error code 17
Error code 17 should be
#define SQLITE_SCHEMA      17   /* The database schema changed */

This happens when it tries to open a connection with the "quota" vfs. May be even 2 different failures :(
(In reply to comment #91)
> SQLite 3.7.7 beta was fine on tryserver for all tests, but bug 610789. Nothing
> new there. While 4 runs failed silently, one reported this:
> 
> WARNING: SQLite returned error code 17
> Error code 17 should be
> #define SQLITE_SCHEMA      17   /* The database schema changed */
> 
> This happens when it tries to open a connection with the "quota" vfs. May be
> even 2 different failures :(
Which test failed?  Should we take this discussion to the bug to update SQLite?
(In reply to comment #92)
> (In reply to comment #91)
> > SQLite 3.7.7 beta was fine on tryserver for all tests, but bug 610789.

> Which test failed?  Should we take this discussion to the bug to update
> SQLite?

it's written just above, the random failure is already blocking the upgrade to 3.7.6 bug.
tracking-fennec: 7+ → +
Is there a bug to track for SQLite 3.7.7 update?
Progress looks good in bug 666240. Once that lands, is there something left to do in this bug? Or is that it?
Assignee: azakai → nobody
(I meant bug 666420.)
yes, we should enforce a journal size limit pragma and maybe change checkpoints to be blocking
Depends on: 658305
Depends on: 698102
this should be solved from quite some time, with bug 658305.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: